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Abstract. Bitcoin is a decentralized crypto-currency, and an accom¬ 
panying protocol, created in 2008. Bitcoin nodes continuously generate 
and propagate blocks—collections of newly approved transactions that 
are added to Bitcoin’s ledger. Block creation requires nodes to invest 
computational resources, but also carries a reward in the form of bit- 
coins that are paid to the creator. While the protocol requires nodes 
to quickly distribute newly created blocks, strong nodes can in fact gain 
higher payoffs by withholding blocks they create and selectively postpon¬ 
ing their publication. The existence of such selfish mining attacks was 
first reported by Eyal and Sirer [^, who have demonstrated a specific 
deviation from the standard protocol (a strategy that we name SMI). 

In this paper we extend the underlying model for selfish mining attacks, 
and provide an algorithm to find e-optimal policies for attackers within 
the model, as well as tight upper bounds on the revenue of optimal 
policies. As a consequence, we are able to provide lower bounds on the 
computational power an attacker needs in order to benefit from selfish 
mining. We find that the profit threshold - the minimal fraction of re¬ 
sources required for a profitable attack - is strictly lower than the one 
induced by the SMI scheme. Indeed, the policies given by our algorithm 
dominate SMI, by better regulating attack-withdrawals. 

Our algorithm can also be used to evaluate protocol modifications that 
aim to reduce the profitability of selfish mining. We demonstrate this 
with regard to a suggested countermeasure by Eyal and Sirer, and show 
that it is slightly less effective than previously conjectured. Next, we gain 
insight into selfish mining in the presence of communication delays, and 
show that, under a model that accounts for delays, the profit threshold 
vanishes, and even small attackers have incentive to occasionally deviate 
from the protocol. We conclude with observations regarding the com¬ 
bined power of selfish mining and double spending attacks. 


1 Introduction 

In a recent paper, Eyal and Sirer have highlighted a flaw in the incentive 
scheme in Bitcoin. Given that most of the network follows the “standard” Bit- 
coin protocol, a single node (or a pool) which possesses enough computational 
resources or is extremely well connected to the rest of the network can increase 
its expected rewards by deviating from the protocol. While the standard Bitcoin 


protocol requires nodes to immediately publish any block that they Hnd to the 
rest of the network, Eyal and Sirer have shown that participants can selhshly 
increase their revenue by selectively withholding blocks. Their strategy, which 
we denote SMI, thus shows that Bitcoin as currently formulated is not incentive 
compatible. 

On the positive side, SMI (under the model of Eyal and Sirer) becomes 
profitable only when employed by nodes that posses a large enough share of the 
computational resources, and are sufficiently well connected to the rest of the 
network]^ It is important to note, however, that SMI is not the optimal best- 
response to honest behaviour, and situations in which SMI is not profitable may 
yet have other strategies that are better than strict adherence to the protocol. 
Our goal in this paper is to better understand the conditions under which Bitcoin 
is resilient to selfish mining attacks. To this end, we must consider other possible 
deviations from the protocol, and to establish bounds on their profitability. 

The role of incentives in Bitcoin should not be underestimated: Bitcoin trans¬ 
actions are confirmed in batches, called blocks whose creation requires generating 
the solution to computationally expensive proof-of-work “puzzles”. The security 
of Bitcoin against the reversal of payments (so-called double spending attacks) 
relies on having more computational power in the hands of honest nodes. Block 
creation (which is also known as mining), is rewarded in bitcoins that are given 
to the block’s creator. These rewards incentivize more honest participants to in¬ 
vest additional computational resources in mining, and thus support the security 
of Bitcoin. 

When all miners follow the Bitcoin protocol, a single miner’s share of the 
payoffs is equal to the fraction of computational power that it controls (out 
of the computational resources of the entire network). However, Selfish mining 
schemes allow a strong attacker to increase its revenue at the expense of other 
nodes. This is done by exploiting the conflict-resolution rule of the protocol, 
according to which only one chain of blocks can be considered valid, and only 
blocks on the valid chain receive rewards; the attacker creates a deliberate fork, 
and (sometimes) manages to force the honest network to abandon and discard 
some of its blocks. 

The consequences of selfish mining attacks are potentially destructive to the 
Bitcoin system. A successful attacker becomes more profitable than honest nodes, 
and is able to grow steadilyEI It may thus eventually drive other nodes out of 
the system. Profits from selfish mining increase as more computational power is 
held by the attacker, making its attack increasingly effective, until it eventually 
holds over 50% of the computational resources in the network. At this point, the 
attacker is able to collect all block rewards, to mount successful double spending 
attacks at will, and to block any transaction from being processed (this is known 
as the 50% attack). 

® This can partly explain why selfish mining attacks have not been observed in the 
Bitcoin network thus far. 

Growth is achieved either by buying more hardware, in the case of a single attacker, 
or by attracting more miners, in the case of a pool. 



We summarize the contributions of this paper as follows: 

1. We provide an efficient algorithm that computes an e-optimal selfish mining 
policy for any e > 0 , and for any parametrization of the model in (i.e., 
one that maximizes the revenue of the attacker up to an error of e, given that 
all other nodes are following the standard Bitcoin protocol). We prove the 
correctness of our algorithm and analyze its error bound. We further verify 
all strategies generated by the algorithm in a selfish mining simulator that 
we have designed to this end. 

2. Using our algorithm we show that, indeed, there are selfish mining strategies 
that earn more money and are profitable for smaller miners compared to 
SMI. The gains are relatively small (see Fig. [T] below). This can be seen 
as a positive result, lower bounding the amount of resources needed for a 
profitable attacker. 

3. Our technique allows us to evaluate different protocol modifications that were 
suggested as countermeasures for selfish mining. We do so for the solution 
suggested by Eyal and Sirer, in which miners that face two chains of equal 
weight choose the one to extend uniformly at random. We show that this 
modification unexpectedly enhances the power of medium-sized attackers, 
while limiting strong ones, and that unlike previously conjectured, attackers 
with less than 25% of the computational resources can still gain from selfish 
mining. 

4. We show that in a model that accounts for the delay of block propagation 
in the network, the threshold vanishes: there is always a successful selfish 
mining strategy that earns more than honest mining, regardless of the size 
of the attacker. 

5. We discuss the interaction between selfish mining attacks and double spend¬ 
ing attacks. We demonstrate how any attacker for which selfish mining is 
profitable can execute double spending attacks bearing no costs. This sheds 
light on the security analysis of Satoshi Nakamoto [H], and specifically, on 
the reason that it cannot be used to show high attack costs, and must instead 
only bound the probability of a successful attack. 

Below, we depict the results of our analysis, namely, the revenue achieved by 
optimal policies compared to that of SMI as well as the profit threshold of the 
protocol. In the following, a stands for the attacker’s relative hashrate, and 7 
is a parameter representing the communication capabilities of the attacker: the 
fraction of nodes to which it manages to send blocks first in case of a block race 
(see Section[ 2 ]for more details). Figure [T] depicts the revenue of an attacker under 
three strategies: Honest mining, which adheres to the Bitcoin protocol, SMI, and 
the optimal policies obtained by our algorithm. The three graphs correspond to 
7 = 0,0.5,1. We additionally illustrate the curve of «/(! — a), which is an upper 
bound on the attacker’s revenue, achievable only when 7=1 (see Section |31). 
Figure [2] depicts the profit threshold for each 7 : If the attacker’s a is below the 
threshold then Honest mining is the most profitable strategy. For comparison, 
we depict the thresholds induced by SMI as well. 




(fraction of hashrate) 

(b) 7 = 0.5 



Fig. 1. The e-optimal revenue and the computed upper bound, as a function of the 
attacker’s hashrate a, compared to SMI, honest mining, and to the hypothetical bound 
provided in Section |31 The graphs differ in the attacker’s communication capability, 7 , 
valued 0, 0.5, and 1. The gains of the e-optimal policies are very close to the computed 
upper bound, except when a is close to 0.5, in case which the truncation-imposed loss 
is apparent. See also Tabled 



























Fig. 2. The profit thresholds induced by optimal policies, and by SMI, as a function 
of 7 . Thresholds at higher 7 values match that of SMI (but still, optimal strategies for 
these values earn more than SMI, once above the threshold). 


The remainder of the paper is structured as follows: We begin by presenting 
our model, based principally on Eyal and Sirer’s [5] (Section [5)). Section [3] shows 
a theoretical bound on the attacker’s revenue. In Section S] we describe our 
algorithm to find optimal policies and values. In Section [5] we discuss more 
results, e.g., the optimal policies. Section |6] analyzes selfish mining in networks 
with delays. Section[7]discusses the interaction between selfish mining and double 
spending. We conclude with discussing related work (Section El). 

2 Model 

We follow and extend the model of [S] , to explicitly consider all actions available 
to the attacker at any given point in time. 

We assume that the attacker controls a fraction a of the computational power 
in the network, and that the honest network thus has a (I — a) fraction. Com¬ 
munication of newly created blocks is modeled to be much faster than block 
creation, so no blocks are generated while others are being transmitted!! 

Blocks are created in the network according to a Poisson process with rate 
A. Every new block is generated by the attacker with probability a, or by the 
honest network with probability (1 — a). The honest network follows the Bitcoin 
protocol, and always builds its newest block on top of the longest known chain. 

® This is justified by Bitcoin’s 10 minute block creation interval which is far greater 
than the propagation time of blocks in the network. This assumption is later removed 
when we consider networks with delay. 









Once an honest node adopts a block, it will discard it only if a strictly longer 
competing chain exists. Ties are thus handled by each node according to the 
order of arrival of blocks. Honest nodes immediately broadcast blocks that they 
create. 

Blocks generally form a tree structure, as each block references a single pre¬ 
decessor (with the exception of the first block that is called the genesis block). 
Since the honest nodes adopt the longest chain, blocks generate rewards for their 
creator only if they are eventually part of the longest chain in the block tree (all 
blocks can be considered revealed eventually). 

To model the communication capabilities of the attacker, we assume that 
whenever it learns that a block has been released by the network, it is able 
to transmit an alternative block which will arrive first at nodes that possess a 
fraction 7 of the computational power of the honest network (the attacker must 
have prepared this block in advance in order to be able to deliver it quickly 
enough). Thus, if the network is currently propagating a block of height /i, and 
the attacker has a competing block of the same height, it is able to get 7 • (1 — a) 
of the computational power (owned by honest nodes) to adopt this block. 

The attacker does not necessarily follow the Bitcoin protocol. Rather, at any 
given time t, it may choose to invest computational power in creating blocks that 
extend any existing block in history, and may withhold blocks it has created 
for any amount of time. A general selfish mining strategy dictates, therefore, 
two key behaviours: which block the attacker attempts to extend at any time 
t, and which blocks are released at any given time. However, given that all 
block creation events are driven by memoryless processes and that broadcast 
is modeled as instantaneous, any rational decision made by the attacker may 
only change upon the creation of a new block. The mere passage of time without 
block creation does not otherwise alter the expected gains from future outcomes^ 
Accordingly, we model the entire decision problem faced by an attacker using a 
discrete-time process in which each time step corresponds to the creation of a 
block. The attacker is thus asked to decide on a course of action immediately 
after the creation of each block, and this action is pursued until the next event 
occurs. 

Instead of directly modeling the primitive actions of block extension and 
publication on general block trees, we can limit our focus to “reasonable” strate¬ 
gies where the attacker maintains a single secret branch of blocks that diverged 
from the network’s chain at some point. (We show that this limitation is war¬ 
ranted and that this limited strategy space still generates optimal attacks in 
Appendix O. Blocks before that point are agreed upon by all participants. Ac¬ 
cordingly, we must only keep track of blocks that are after the fork, and of the 
accumulated reward up to the fork. We denote by a the number of blocks that 
have been built by the attacker after the latest fork, and by h the number of 
those built by honest nodes. 

Formally, if all other participants are following the standard protocol, the 
attacker faces a single-player decision problem of the form M := {S,A,P,R), 


See Section [6] for the implication of delayed broadcasting. 



where S is the state space, A the action space, P a stochastic transition matrix 
that describes the probability of transitioning between states, and R the reward 
matrix. Though similar in structure, we do not regard M as an MDP, since the 
objective function is nonlinear: The player aims to maximize its share of the 
accepted blocks, rather than the absolute number of its own accepted ones; its 
goal is to have a greater return-on-investment than its counterparts 0 
Actions. We begin with the description of the action space A, which will moti¬ 
vate the nontrivial construction of the state space. 

• Adopt. The action adopt is always feasible, and represents the attacker’s 
acceptance of the honest network’s chain. The a blocks in the attacker’s 
current chain are discarded. 

• Override. The action override represents the publication of the attacker’s 
blocks, and is feasible whenever a > h. 

• Match. This action represents the case where the most recent block was 
built by the honest network, and the attacker now publishes a conflicting 
block of the same height. This action is not always feasible (the attacker 
must have a block prepared in advance to execute such a race). The state- 
space explicitly encodes the feasibility status of this action (see below). 

• Wait. Lastly, the wait action, which is always feasible, implies that the 
attacker does not publish new blocks, but keeps working on its branch until 
a new block is built. 

State Space. The state space, denoted S, is defined by 3-tuples of the form 
{a,h, fork). The first two entries represent the lengths of the attacker’s chain 
and the honest network’s chain, built after the latest fork (that is, above the 
most recent block accepted by all). The held fork obtains three values, dubbed 
irrelevant, relevant and active. State of the form {a, h, relevant) means that 
the previous state was of the form (a, h — 1, •); this implies that ii a > h, match is 
feasible. Conversely, (a, h, irrelevant) denotes the case where the previous state 
was (a — l,h, •), rendering match now ineffective, as all honest nodes received 
already the h’th block. The third label, active, represents the case where the 
honest network is already split, due to a previous match action; this information 
affects the transition to the next state, as described below. We will refer to states 
as {a,h) or {a,h,-), in contexts where the fork label plays no effective role. 
Transition and Reward Matrices. In order to keep the time averaging of 
rewards in scale, every state transition corresponds to the creation of a new 
block. The initial state Xq is {1,0, irrelevant) w.p. a or {0,1, irrelevant) w.p. 
(1 — a). Rewards are given as elements in where the first entry represents 
blocks of the attacker that have been accepted by all parties, and the second 
one, similarly, for those of the honest network. 

The transition matrix P and reward matrix R are succinctly described in 
Table [TJ Largely, an adopt action “resets” the game, hence the state following 

^ Another possible motivation for this is the re-targeting mechanism in Bitcoin. When 
the block creation rate in the network is constant, the adaptive re-targeting implies 
that the attacker will also increase its absolute payoff, iir the long ruir. 



it has the same distribution as Xq] its immediate reward is h in the coordinate 
corresponding to the honest network. An override reduces the attacker’s secret 
chain by h + 1 blocks, which it publishes, and which the honest network accepts. 
This bestows a reward of h + 1 blocks to the attacker. The state following a 
match action depends on whether the next block is created by the attacker (a), 
by honest nodes working on their branch of the chain ((1 — 7 ) • (1 — a)), or 
by an honest node which accepted the sub-chain that the attacker published 
(7 • (1 — a)). In the latter case, the attacker has effectively overridden the honest 
network’s previous chain, and is awarded h accordingly. 


Table 1. A description of the transition and reward matrices P and R in the decision 
problem M. The third column contains the probability of transiting from the state 
specified in the left-most column, under the action specified therein, to the state on 
the second one. The corresponding two-dimensional reward (the reward of the attacker 
and that of the honest nodes) is specified on the right-most column. 


State X Action 

State 

Probability 

Reward 

(a, h, •), adopt 

(1, 0, irrelevant) 

a 

(0,h) 

(0, 1, irrelevant) 

1 — a 

{a, h, •), override^ 

(a — h, 0, irrelevant) 

a 

(/i + 1,0) 

{a — h — 1,1, relevant) 

1 — a 

{a, h, irrelevant), wait 
(a, h, relev ant), wait 

(a + 1, h, irrelevant) 

a 

(0,0) 

(a, h-\-l, relevant) 

1 — a 

(0,0) 

{a, h, active), wait 
(a, h, relevant), match^ 

{a + 1, h, active) 

a 

(0,0) 

(a — h, 1, relevant) 

7-(l-a) 

(h,0) 

{a, h + 1, relevant) 

(1 - 7 ) • (1 - a) 

(0,0) 


I feasible only when a > h 
^feasible only when a > h 


Objective Function. As explained in the introduction, the attacker aims to 
maximize its relative revenue, rather than its absolute one as usual in MDPs. 
Let TT be a policy of the player; we will write 7 r(a, h, fork) for the action that tt 
dictates be taken at state (a, h, fork). Denote by Xf the state visited by time 
t under tt, and let r(x,y,Tr) = {r^ {x, y, n), r'^ (x , y, tt)) be the immediate reward 
from transiting from state x to state y, under the action dictated by tt. XJ^ will 
denote the t’th state that was visited. We will abbreviate rt{Xf, X^_^_l,^T) and 
write simply rt{'!T) or even rt, when context is clear. The objective function of 
the player is its relative payoff, defined by 


REV := E 


lim inf 
T ^00 




EU (hW + r2(^)) 


( 1 ) 


We will specify the parameters of REV depending on the context (e.g., 
REV{TT,a,'y), REV{Tr), REV (a)), and will occasionally denote the value of 
REV by p. In addition, for full definiteness of REV, we rule out pathological 


















behaviours in which the attacker waits forever—formally, the expected time for 
the next non-null action of the attacker must be finite. 

Honset Mining and SMI. We now define two policies of prime interest to 
this paper. Honest mining is the unique policy which adheres to the protocol at 
every state. It is defined by 


honest mining (a, /i, •) 


adopt ft, > a 1 

override a > ft f ’ 


( 2 ) 


and wait otherwise. Notice that under our model, REV (honest mining, a, 7) = a 
for all 70 Eyal and Sirer’s selfish mining strategy, SMI, can be defined as 


SMI (a, ft, •) 


adopt 

match 

override 

wait 


h > a 
h = a = 1 
ft = a — 1 > 1 
otherwise 

/ 


( 3 ) 


Profit threshold. Keeping the attacker’s connectivity capabilities (7) fixed, we 
are interested in the minimal a for which employing dishonest mining strategies 
becomes profitable. We define the profit threshold by: 

d(7) := inf {dTT e H : i?i5K(7r, a, 7) > i?£iH(honest mining, a, 7)} . ( 4 ) 


3 A Simple Upper Bound 

The mechanism implied by the longest-chain rule leads to an immediate bound 
on the attacker’s relative revenue. Intuitively, we observe that the attacker cannot 
do better than utilizing every block it creates to override one block of the honest 
network. The implied bound is provided here merely for general insight—it is 
usually far from the actual maximal revenue. 

Proposition 1. For any ir, REV (tt, a, 7) < 7^. Moreover, this bound is tight, 
and achieved when 7 = 1 . 

See Appendix [B] for the proof. 


4 Solving for the Optimal Policy 

Finding an optimal policy is not a trivial task, as the objective function © 
is nonlinear, and depends on the entire history of the game. To overcome this 
we introduce the following method. We assume first that the optimal value of 
the objective function is p, then construct an infinite un-discounted average 
reward MDP (with “standard” linear rewards), compute its optimal policy (using 
standard MDP solution techniques), and if the reward of this policy is zero then 
it is optimal also in the original decision problem M. We elaborate on this 
approach below. 

® Indeed, in networks without delay, honest mining is equivalent to the policy 
{ adopt if (a, ft) = (0,1) ; override if (a, ft) = (1, 0) }, as these are the only reach¬ 
able states. Delays allow other states to be reached, and will be covered in Section[6l 




4.1 Method 


For any p G [0,1], define the transformation Wp : — >■ Z by Wp{x,y) := 

{1 — p) ■ X — p ■ y. Define the MDP Mp := {S, A, P, Wp{R))-, it shares the same 
state space, actions, and transition matrix as M, while M’s immediate rewards 
matrix is transformed according to Wp. For any admissible policy tt denote by 
rjp the expected mean revenue under tt, namely. 




1 


T 


liminf — y w, 
T^-oo T ^ 

t=\ 


,(rt(7r)) 


( 5 ) 


and by 

Vp = max {u; } (6) 

the value of Mp^ Our solution method is based on the following proposition: 

Proposition 2. 1. If for some p G [0, 1], v* = 0, then any policy tt* obtaining 

this value (thus maximizing vf) also maximizes REV, and p = REV{tt*). 

2. V* is monotonically decreasing in p. 

Following these observations we can utilize the family Mp to obtain an opti¬ 
mal policy: We perform a simple search for a p such that the optimal solution 
of Mp has a value of 0. Since v* is monotonically decreasing, this search can 
be done efficiently, using binary search. In practice, our algorithm relies on a 
variation of Proposition [2 which will be proven formally in Appendix [Cl 

Due to the fact that the search domain is continuous, practically, one would 
need to halt the search at a point that is sufficiently close to the actual value, 
but never exact. Moreover, in practice, MDP solvers can solve only finite state 
space MDPs, and even then only to a limited degree of accuracy. Our algorithm 
copes with these computational limitations by using finite MDPs as bounds to 
the original problem, and by analyzing the potential error that is due to inexact 
solutions. 


4.2 Translation to Finite MDPs 

We now introduce two families of MDPs, closely related to the family Mp-. Fix 
some T G N. We define an under-paying MDP, Mj, which differs from Mp only 
in states where max {a, h} = T, in which it only allows only for the adopt action. 
We denote this modified action space by . Clearly, the player’s value in Mj 
lower bounds that in Mp, since in the latter the attacker might be able to do 
better by not adopting in the truncating states. Consequently, this MDP can 
only be used to upper bound the threshold (in a way described below). 

To complete the picture we need to bound the optimal value from above, 
and we do so by constructing an over-paying MDP, Nj. This MDP shares the 

® The equivalence of this formalization of the value function and alternatives in which 
the order of expectation and limit is reversed is discnssed in [4]. 





same constraint as Mj, yet it compensates the attacker in the states where 
max {a, h} = T, by granting it a reward greater than what it could have gotten 
in the un-truncated process: When T = a > h, the attacker is awarded 


(1-p). 


a • (1 — a) 
(1-2-a)" 


1 

2 



d h 


On the other hand, when T = h > a, it is awarded 


(7) 



i-p-h) + 



■il-p) 


f a • (1 — a) 

V(l-2-«f 


h — a 
1-2-a 


Denote by vj* and uj* the average-sum optimal values of the under-paying 
MJ and the over-paying Nj, respectively (i.e., the expected limiirf of the average 
value, for the best policy in , similar to (l5])-(|6])). The following proposition 
formalizes the bounds provided by the over-paying and under-paying MDPs: 


Proposition 3. For any T G N, if Vp* > 0 then u^* > Vp* > vj*. Moreover, 

these bounds are tiqht: lim uf* — vf* = 0. 

^ P P 


The proof is differed to the appendix. Having introduced these MDP families, 
we are now ready to present an algorithm which utilizes them to obtain upper 
and lower bounds on the attacker’s profit. 


4.3 Algorithm 
Algorithm 1 

Input: a and 7, a truncation parameter Tq G N, and error parameters 0 < e < 
8 • a, 0 < e' < 1 

1. low ■<— 0, high ■<— 1 

2. do 

3. p <—(low + high)/2 

4- (t^jv) t— mdp_solver{Mj° ,e/8) 

5. if (v > 0) 

6. low <— p 

7. else 

8. high ^ p 

9. while(high — low > e/8) 

10. lower-bound ■<— (p — e) 

11. lower-bound-policy ■<— tt 

12. p'^ max{Zow — e/4,0} 

13. (tt, u) ^ mdp-Solver{NpP ,e') 

14 . upper-bound ■(— (p' -I- 2 ■ (m -|- e')) 









The algorithm initializes the search segment to be [0,1] (line[T|) and begins 
a binary search: p is assigned the middle point of the search segment (line [3]), 
and the algorithm outputs an e/8-optimal policy of and its value (line|4|). 
The loop halts if the size of the search segment is smaller then e/8. Otherwise, 
it restricts the search to the larger half of the segment, if the value is positive 
(lineini), 01' to the lower half, in case it is negative (line [5]). This essentially rep¬ 
resents a binary search for an approximate-root of vj °, which is a monotonically 
decreasing function of p. The algorithm outputs {p — e) as a lower bound on 
the player’s relative revenue, and tt as an e-optimal policy. These assertions are 
formalized in the proposition below: 

Proposition 4. For any Tq G N and e > 0, Algorithm [7] halts, and its output 
(p, tt) satisfies: \p — REV{Tr)\ < e and \p — max^/g^To {REV {tt')} | < e. 

The second part of the algorithm (lines [T7JI11) computes an e'-optimal policy 
for the over-paying MDP NJ° , for p' = (low — e/4)+ (using the value assigned 
last to low). If u is the outputted value, the algorithm returns p -|- 2 • (it -I- e') as 
an upper bound to the player’s revenue lline fTTl) . 

Proposition 5. Ifu and p' are the outcome of the computation in Algorithm^Jl 
lines ITMTfil then p' + 2 ■ {u + e') > max^/g^i {REV{Tr')}. 

Both propositions are proved in Appendix [C] 


4.4 Profit threshold Calculation 

The threshold q:( 7 ) marks the minimal computational power an attacker needs in 
order to gain more than its fair share (see Section[2]). It is crucial in assessing the 
system’s resilience: An attacker above the threshold is able to receive increased 
returns on its investment, to grow steadily in resources,!^ and eventually to push 
other nodes out of the game. The system is safe against such a destructive 
dynamic if all miners hold less than 0 ( 7 ) of the computational power. 

Fix 7 . A simple method allows us to lower bound the threshold: We first 
modify the action space of the overpaying so as to disable the option of honest 
mining; technically, this is done by removing override from the feasible actions in 
(1,0) and then, separately, removing adopt in (0,1). Denote this modified MDP 
by . Then we solve N^, for some a, error parameter e, and truncation T. 
If the mdp-solver returns a value smaller than (—e) (both for when override is 
disabled in (I, 0) and when adopt is disabled in (0,1)), we are assured that honest 
mining is optimal in the original setup. We perform a search for the maximal 
a satisfying this requirement, i.e., q;( 7 ), in a fashion similar to the search in 
Algorithm [TJ 

Corollary 6. Fix 7 and a. If u is the value returned by mdp-Solver{Nff, e), and 
u < —e, then honest mining is optimal for a. In other words, 0 ( 7 ) > a. 




5 Results 


5.1 Optimal Values 

We ran Algorithm[T]for 7 from {0, 0.5,1}, with various values of a, using an MDP 
solver for MATLAB (an implementation of the relative value iteration algorithm 
developed by Chades et al. 0)- The error parameter e was set to be 10 ® and 
the truncation was set to T = 75. The values of p returned by the algorithm, for 
7 = 0 ,0.5, 1 , are depicted in Figure [T] above. Additionally, some values for 7 = 0 
appear in Table O computed for parameters T = 95 and e = 10“®. The results 
demonstrate a rather mild gap between the attacker’s optimal revenue and the 
revenue of SMI. In addition, the graphs depict the upper bound on the revenue 
provided in Section |31 as we stated there, the bound is obtained when 7 = 1 , 
which is observed clearly in the corresponding graph. 

Table 2. The revenue of the attacker under SMI and under the e-OPT policies, com¬ 
pared to the computed upper bound, for various a and with 7 = 0 . 


a 

SMI 

e-OPT 

Upper- 

Bound 

1/3 

1/3 

0.33705 

0.33707 

0.35 

0.36650 

0.37077 

0.37079 

0.375 

0.42118 

0.42600 

0.42604 

0.4 

0.48372 

0.48866 

0.48904 

0.425 

0.55801 

0.56808 

0.57226 

0.45 

0.65177 

0.66891 

0.70109 

0.475 

0.78254 

0.80172 

0.90476 


5.2 Optimal Policies 

Below we illustrate two examples of the behaviour of the e-optimal policies re¬ 
turned by the algorithm. The policies are described by tables, with the row 
index corresponding to a and the columns to h. The table-entry (a, h) contains 
three characters, specifying the actions to be taken in states {a, h, irrelevant)^ 
(a, h, relevant), and (a, h, active) correspondingly. Table [3] contains a description 
of an optimal policy, for an attacker with a = 0 . 45,7 = 0.5. Table |4] describes 
optimal actions for the setup a = 1/3 ,7 = 0 . Notice that in the latter the match 
action is irrelevant, which allows us to regard in the second table only states 
with fork = irrelevant. In both tables only a subset of the states is depicted, 
the whole space being infinite. 

To illustrate how Table |3] should be read, consider entry (a,/i) = (3,3), for 
instance. The string “wm*” in this entry reads: “in case a fork is irrelevant 
(that is, the previous state was (2,3)), wait; in case it is relevant (the previous 
state was (3, 2)), match; the case where a fork is already active is not reachable”. 













Table 3. The optimal policy for an attacker with a = 0.45 and 7 = 0.5, for states 
(a, h, •) with a,h < 8. The rows index the attacker’s chain length (a), and the columns 
the honest network’s (h). The three characters in each entry represent the action to be 
taken if fork = irrelevant, relevant, or active, ‘a’, ‘o’, ‘m’, and ‘w’ stand for adopt, 
override, match, and wait, respectively, while represents an unreachable state. 
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Table 4. The optimal policy for an attacker with a = 0.35 and 7 = 0 . The table 
describes the actions only for states of the form (a, h, irrelevant) with a,h < 8. (See 
previous caption) 
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Looking into these optimal policies we see they differ from SMI in two ways: 
First, they defer using adopt in the upper triangle of the table, if the gap between 
h and a is not too large, allowing the attacker to “catch up from behind”. 
Thus, apart from block withholding, an optimal attack may also contain another 
feature: attempting to catch up with the longer public chain from a disadvantage. 
This implies that the attacker violates the longest-chain rule, a result which 
counters the claim that the longest-chain rule forms a Nash equilibrium (see |11] . 
and discussion in Section [5]) 

Secondly, they utilize match more extensively, effectively overriding the hon¬ 
est network’s chain (w.p. 7 ) using one block less. 

5.3 Thresholds 

Following the method described in Section 14.41 we are able to introduce lower 
bounds for the profit thresholds. Figure [2] depicts the thresholds induced by 
optimal policies, compared to that induced by SMI. The results demonstrate 
some cutback of the thresholds, when considering policies other than SMI. 

5.4 Evaluation of Protocol Modifications 

Several protocol modifications have been suggested to counter selfish mining 
attacks. It is important to provably verify the merit of such suggestions. This can 
be done by adapting our algorithm to the MDPs induced by these modifications. 
Below we demonstrate this with respect to the rule suggested by Eyal and Sirer. 
According to the Bitcoin protocol, a node which receives a chain of length equal 
to that of the chain it currently adopts, ought to reject the new chain. Eyal and 
Sirer suggest to instruct nodes to accept the new chain with probability 1/2. We 
refer to it below as “uniform tie breaking”. 

The immediate effect of this modification is that it restricts the efficiency of 
the match action to 1 / 2 , even when the attacker’s communication capabilities 
correspond to 7 > 1 / 2 . Admittedly, this limits the power of strongly communi¬ 
cating attackers, and thus guarantees a positive lower bound on the threshold 
for profitability of SMI (which was 0, when 7 = 1). On the other hand, it has the 
apparent downside of enhancing the power of poorly communicating attackers, 
that is, it allows an attacker to match with a success-probability 1/2 even if its 
“real” 7 is smaller than 1 / 2 . 

Unfortunately, our results show that this protocol enhances the profit of 
some attackers from deviations. For example, by applying Algorithm [T] to the 
setup induced by uniform tie breaking, we found that attackers in the range 
{7 = 0.5 , 0.2321 < a < 0.5} benefit from this modification. In particular, the 
profit threshold deteriorates from 0.25 to 0.2321. Figure [3] demonstrates this 
by comparing the attacker’s optimal revenue under the uniform tie breaking 
protocol with the optimal revenue under the original protocol. The dominating 
policy is described in Table [S] 

The intuition behind this result is as follows: Under uniform tie breaking, two 
chains of equal length will be mined equally regardless of the passage of time 


between the transmission of their last blocks. This allows an attacker to perform 
match even if it did not have a block prepared in advance, thereby granting it 
additional chances to catch up from behind. Deviation from the longest-chain 
rule thus becomes even more tempting. 
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Fig. 3. The attacker’s optimal revenue under uniform tie breaking, compared to that 
under the original protocol (with 7 = 1 / 2 ) and to honest mining. 


5.5 Simulations 

In order to verify the results above we built a selfish mining simulator which we 
implemented in Java. We ran the simulator for various values of a and 7 (as 
in the figures above), where the attacker follows the policies generated by the 
algorithm. Each run was performed for 10^ rounds (block creation events). The 
relative revenue of the attacker matched the revenues returned by the algorithm, 
up to an error of at most ± 10 “®. 

6 A Model that Considers Delays 

So far, our model assumed that no new block is created until all preceding 
published blocks arrived at all nodes. In reality, there are communication delays 
between nodes in the network, including between the attacker and others. Thus, 
instead of modeling the attacker’s communication capabilities via the parameter 





Table 5. The optimal policy for an attacker with a = 0.25, under the “50-50” protocol 
modification suggested in [9]. Only states (a, h, •) with a,h < 8 are depicted. This policy 
outperforms honest mining. 
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7 , it may be better to directly consider the non-negligible effect of network 
latency directly. Delays are especially noticeable when the system’s throughput 
is increased by allowing larger blocks to form or by increasing block creation 
rates (see m)- While this makes the encoding of the game rather complicated, 
a priori we can make the following observations: 

1. The attacker has only a partial knowledge of the world state. Furthermore, 
blocks which it publishes may arrive at the honest network too late, which 
potentially reduces the benefit of block withholding. 

2. Natural forks occur within the honest network, and consequently its chain 
grows in a rate lower than one block per round; this potentially makes attacks 
more successful. 

3. Natural forks involving the attacker imply that the game arrives at non¬ 
trivial states, even under honest mining. The attacker may thus mine hon¬ 
estly until some particular deviation becomes feasible. 

4. In the presence of delays, the attacker’s share under honest mining might be 
greater than a, which raises the bar for dishonest strategies to prevail^ 

The overall effect of the above cannot be determined without knowing the 
topology of the network and the attacker’s location in it (as well as its knowl¬ 
edge about the topology). Still, some insight is possible. Following the third 
observation above, we notice that a dishonest policy tt is one in which for some 
h > a : 7r(a, h) ^ adopt and/or for some a > h \ 7r(a, h) ^ override (whereas un¬ 
der no delays honesty in (1,0) and (0,1) suffices). We claim that, consequently, 
the profit threshold equals 0. In other words, every attacker benefits from some 
form of dishonest mining. 

Claim 7 When the network sujfers some delays, the attacker has a strict better- 
response strategy to honest mining, for any a > 0. 


See [13] . for one result quantifying this effect. 






















Below we provide a proof sketch, which contains the jist of the claim while 
avoiding the involved formalization of the process under delays. We do mention 
that the rewards in Mp are now given by the expected outcome of future events, 
specifically the resolution of future conflicts. For instance, following an override 
action in (a, h), the attacker is awarded (1 — p) ■ (h times the probability 
that its block will be accepted by all nodes, eventuallvFH 


Proof (sketch). Fix fc S N, and let be the policy in which the attacker mines 
honestly, until some state {k — l,k) is reached (and observable to it). Upon which, 
instead of adopting, the attacker tries to catch up from behind, until it either 
succeeds (and then it overrides) or it learns of another block of the honest net¬ 
work (and then it adopts). Formally, TTk{a,h) := a> h 


adopt h > a A h ^ k 
Since tt^ is stationary, we can analyze its long-term earnings in Mp following the 
result from Lemma [S] 

Denote ph '■= i?ii'U(honest mining) and pk '■= REV{-Kk)- Upon reaching 
(fc — 1, fc), the attacker’s immediate reward under honest mining is {—ph • k). On 
the other hand, if it follows tt^, its expected immediate rewards are at least 


q-{l- Pk)-{kPl)-{l-q)- pk-{k + 1), (8) 


where g is a lower bound on the probability that it will succeed to bypass the 
honest network’s chain and override it in time. The positive term in ([5]) corre¬ 
sponds to the case where the scheme ends successfully (with an override), and 
the negative one to the complementary scenario. To avoid dependencies on k, q 
can be taken to equal 

POO POO 

/ / (a • A)^ • (9) 

Jo Jo 

where dh^a is the communiation delay on the link from the honest cluster to the 
attacker, and da,h the delay on the reversed link (for simplicity, we assume that 
in both directions there are single links connecting these parties to one another). 
Indeed, the integrand above represents the probability that the next two blocks 
of the attacker will take a time of t-I-s to be generated ((a ■ A)^ . 
that the honest network hasn’t been able to create a block since the beginning of 
the propagation of its fc’th block, and until the attacker’s (/c-|-l)-block propagated 
throughout the network 

Assume by way of negation that ph > Pk- If k is large enough, the following 
relation holds: 


q- {1 - Pk) ■ {k+ 1) - {1 - q) ■ Pk ■ {k+ 1) - {-ph ■ k) > ( 10 ) 

q- {1 - Pk) ■ {k+ 1) - {1 - q) ■ Pk ■ {k+ 1) + Pk ■ k = 

{k + 1) ■ q - Pk > 0. (II) 

This implies that the expected rewards of tt^, resulting form state (fc — I, fc) being 

reached, exceed those of honest mining upon reaching this state. Since this is 


It can be shown that this is actually decided in finite time, in expectation m- 



the only state in which these strategies differ, the inequality above implies that tt^ 
strictly dominates honest mining, thus pk = REV (iTk) > REV (honest mining) = 
Ph- We conclude that any attacker can benefit from deviating in some states from 
honest mining, hence that the profit threshold vanishes. 

□ 

The intuition behind this result is clear: The attacker suffers a significant loss if 
it adopts in (fc — 1, fc), when k is large, and it thus prefers to continue the fork 
that formed naturally, and attempt to catch up. 

This illustrates the importance of the policies found by Algorithm [T] As 
we’ve seen (Section [5]), those dominate SMI in that they delay adoption, i.e., 
they allow the continuation of the attack even when the honest network’s chain 
is longer than the attacker’s. While the additional benefit was rather mild, this 
added feature becomes more important in networks with delays, where splits 
in the chain occur naturally with some probability, even when honest mining is 
practiced by all. 

To gain further understanding of selfish mining under delays it would be 
important to quantify the optimal gains from such deviations. We leave this as 
an open question for future research. Still, it is clear that Bitcoin will be more 
vulnerable to selfish mining if delays become more prominent, e.g., in the case 
of larger blocks (block size increases are currently being discussed within the 
Bitcoin developers community). 

7 Effect on Double Spending Attacks 

In this section we discuss the qualitative effect selfish mining has on the secu¬ 
rity of payments. The regular operation of bitcoin transactions is as follows: A 
payment maker signs a transaction and pushes it to the Bitcoin network, then 
nodes add it to the blocks they are attempting to create. Once a node succeeds it 
publishes the block with its content. Although the payee can now see this update 
to the public chain of blocks, it still waits for it to be further extended before 
releasing the good or service paid for. This deferment of acceptance guarantees 
that a conflicting secret chain of blocks (if exists) will not be able to bypass and 
override the public one observed by the payee, thereby discard the transaction. 
Building a secret chain in an attempt to reverse payments is called a double 
spending attack. 

Success-probability. Satoshi Nakamoto, in his original white paper, provides 
an analysis regarding double spending in probabilistic terms: Given that the 
block containing the transaction is followed by n subseguent blocks, what is the 
probability that an attacker with computational power a will be able to override 
this chain, now or in the future? Nakamoto showed that the success-probability 
of double spending attacks decays exponentially with n. Alternative and perhaps 
more accurate analyses exist, see |15j . m- 

Cost. While a single double spending attack succeeds with negligible probability 
(as long as the payee waits long enough), regrettably, an attacker which contin¬ 
uously executes double spending attempts will eventually succeed (a.s.). We 


should therefore be more interested in the cost of an attack than in its success- 
probability. Indeed, every failed double spending attack costs the attacker the 
potential award it could have gotten had it avoided the fork and published its 
blocks right away. 

Observe, however, that a smart strategy for an attacker would be to con¬ 
tinuously employ selfish mining attacks, and upon success combine them with 
a double spending attack. Technically, this can be done by regularly engaging 
in public transactions, while always hiding a conflicting one in the attacker’s 
secret blocks^ There is always some probability that by the time a successful 
selfish mining attack has ended, the payment receiver has already accepted the 
payment, which additionally results in a successful double spending. 

To summarize, the existence of a miner for which selhsh mining is at least as 
prohtable as honest mining fundamentally undermines the security of payments, 
as this attacker bears no cost for continuously attempting to double spending, 
and it eventually must succeed. Similarly, an attacker that cannot profit from 
selfish mining alone, might be profitable in the long run if it combines it with 
double spending, which potentially has grave implications on the profit threshold. 


8 Related Work 

The Bitcoin protocol was introduced in a white paper published in 2008 by 
Satoshi Nakamoto M- In the paper, Nakamoto shows that the blockchain is 
secure as long as a majority of the nodes in the Bitcoin network follow the 
protocol. Kroll et al. m show that, indeed, always extending the latest block 
in the blockchain forms a (weak, non-unique) Nash equilibrium, albeit under a 
simpler model that does not account for block withholding. 

On the other hand, it has been suggested by various people in the Bitcoin fo¬ 
rum that strong nodes might be incentivized to violate the protocol by withhold¬ 
ing their blocks [T] . Eyal and Sirer proved this by formalizing a block withholding 
strategy SMI and analyzing its performance [9]. Their strategy thus violates the 
protocol’s instruction to immediately publish one’s blocks, but still sticks to the 
longest-chain rule (save a selective tie breaking). SMI 1 still abandons its chain 
if the honest nodes create a longer chain. One result of our paper is that even 
adhering to the longest-chain rule is not a best response. We also prove what 
the optimal policies are, and compute the threshold under which honest min¬ 
ing is a (strict, unique) Nash equilibrium. Additional work on selfish mining via 
block withholding appears in [5]. Transaction propagation in Bitcoin has also 
been analyzed from the perspective of incentives. Results in [5] show that nodes 
have an incentive not to propagate transactions, and suggests a mechanism to 
correct this. Additional analysis from a game theoretic perspective has also been 
conducted with regards to interactions pools, either from a cooperative game 
theory perspective [12) . or when considering attacks between pools [8]. 
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In the worst case, the attacker is frequently engaged in “real” transactions anyways, 
hence suffers no loss from them being occasionally confirmed, when attacks fail. 



A recent paper by Gobel et al. has evaluated SMI in the presence of de¬ 
lays m- They show that SMI is not profitable under a model of delays that 
greatly differs from our own (in particular, they assume that block transmission 
occurs as a memoryless process). While SMI may indeed be unprofitable when 
delay is modeled, we show that other profitable selfish mining attacks exist. Ad¬ 
ditional analysis of block creation in the presence of delays and its effects on 
throughput and double spending appears in mm- 

Further discussion on Bitcoin’s stability can be found in a recent survey by 
Bonneau et al. 0- 
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A Generality of the Model 

As mentioned in Section [5J the most general setup would be for an attack- 
strategy to consider also building its blocks in different places in the block-tree 
(say, extending a previously abandoned chain, or adopting a subchain of the 
public chain) and/or to publish more than h + 1 blocks upon overriding the 
public honest chain. It is clear, intuitively, why such actions are suboptimal. 
Below we make this formal. 

Let TT be an optimal strategy, when the above actions are available to the 
attacker as well. 

Part /.'Assume there exists a state (a, h) where the attacker publishes h + j 
blocks with j > 1; we denote this by TT{a,h) =“override by j”. We now con¬ 
struct a policy tt', which follows tt everywhere except that 7r'(a, h) = wait. By 
Corollarylini it suffices to show that Vp > vj = 0, where p is the relative revenue 
induced by tt; this will imply that no reduction in REV occurs when switching 
from ^^override by j” to wait. 

For every state X, denote by v'^ (X) the expected value of (1—p)-E [i?^’ 
p-E [i?2d( tt)] conditioned on arriving at state X (recall that ri is the terminating 
state of the first run). We need to show that (a, h) > v'^{a, h), as this is the 
only states where these policies differ. Observe that Vp{a,h) = {1 —p) ■ j + Vp{a — 
j, 0). This is because either j < a, and this action cannot lead to a termination 
(hence the addition of Vp {a—j, 0)), or j = a, and then Vp {a—j, 0) = (0, 0) = 0, 

which fits the fact that a termination occurred. We thus need to show that 
Vp'{a,h) > (1 - p) ■ j + v'^{a-j,0). 

Indeed, consider the case where tt' performs ^^override by j” if X = (a, h+1) 
or “override by {j -\- 1)” if X = {a+l,h). Note that the action in the first case is 
feasible, since a > h+j > h+1, and similarly a-|-I > h+j + 1 > h, for the second 
case. In the former case we obtain v'^ {a, h+1) = {1 — p) ■ j +Vp{a — j, 0), and in 
the latter, Vp {a, h + 1) = (1 —p)-(j-|-I)-|-Vp (a-|-I — (j + 1), 0). Therefore, we have 
presented an action-scheme which guarantees tt' the value of tt. As tt (hence tt') 
optimize the value Vp(X), for any X, we have that the value of tt (hence of tt') 
in the states {a + l,h) and {a, h+1) is at least as high as (1 — p) -j + Vpia — j, 0), 
which completes this part of the proof. 

Part II: We claimed, additionally, that the attacker will never adopt branches 
in the block-tree other than its current secret one and the honest faction’s cur¬ 
rent longest one. We now aim to justify this assertion, albeit with some infor¬ 
malities; a formal proof is not possible under our model, because it implicitly 


assumes that actions as override and adopt grant immediate reward, whereas if 
the attacker adopts older abandoned chains it can hypothetically reverse such 
decisions. Nonetheless it is very clear why this would be suboptimal; 

For any {a,h), let Ai,...,Aa denote the attacker’s chain, and the 

honest network’s chain, and let Hq be the block that Ai and Hi extend (it is now 
public, but may have belonged to the attacker) 1^ Let now (a, h) be the first state 
at which the attacker decides to deviate and extend a block B other than Aa or 
Hh. If B was not created after Hq (and B ^ Hq), then it was available to the 
attacker at the time it began extending Hq. By the choice of (a, h), extending Hq 
was then at least as profitable as extending B, and this dominance is invariant 
under future events (e.g., by the public chain that formed above Hq). Thus the 
attacker can just as well repeat its initial choice of Hq over B. 

A similar argument holds for the case where B was created after Hq (or 
B = Hq). Denote by / the length of the attacker’s chain upon the creation of B. 
Extending Ai was then at least as profitable as extending B, by the choice of 
(a, h), and this again is not altered by future events. All the same, the attacker 
can just as well repeat its choice and choose A; over B. in conclusion, we can 
restrict our attention to strategies restricted to our three-action model (four, 
with wait), without loss of generality. This also enables a Markovian model, 
fortunately, as described in Section [2] 


B Proof of Proposition [T] 


Proposition [TJ 

For any tt, REV{'K,a,^) < Moreover, this bound is tight, and achieved 

when 7 = 1. 

Proof. We can map every block of the honest network which was overridden, to 
a block of the attacker; this is because override requires the attacker to publish 
a chain longer than that of the honest network’s. 

Let fc-r be the number of blocks that the attacker has built up to time T. 
The honest network thus built It '.= T — kx by this time. The argument above 
shows that It — r* < fcr- Also, Pt{It > kx) —> 1, when T —>• oo. Therefore, 
the relative revenue satisfies: 


REVin) 


lim 
T —^OO 




EL 


i=i L 


ELf 


< lim 


El 
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El 


t=i 't 


} + It — kj 


lim 

T—¥00 


1 

l + {lT-kT)/(Yl=ir]) 


< lim 


T-j-oo 1 -|- (lx — kx) /kx 


lim 
T — 


kj' 

It 


( 12 ) 

(13) 


The SLLN applies naturally to kx and lx-, implying that the above equals 

,, rj. = (a.s.). 

(1—1 —a '' ' 


In case the honest network is forked, pick one of them arbitrarily; blocks are anony¬ 
mous, and they are only accepted or rejected according to the lengths of their chains, 
which are in this case equal. 








To see that the bound is achieved in 7 = 1, observe that the policy SMI 
satisfies the property that every block of the attacker overrides one block of the 
honest network, and that none of the attacker’s blocks are overridden (as the 
policy never reaches a state where it needs to adopt, except when a = 0). This 
turns both inequalities in (IT^ - (fT^ into equalities. □ 


C Correctness of Algorithm [T] 


In this section we prove that Algorithm [T] halts and that its output meets the 
conditions specified therein. We begin with applying here a Strong Law of Large 
Numbers, which will prove useful along our path. Under a fixed stationary policy 
TT, we denote by ri the renewal time of the game. Formally, ti is the time, 
or number of visited states, until the game reaches a state s from which the 
transition probabilities are a to state ( 1 , 0 ) and 1 — a to ( 0 , 1 ). 

Lemma 8. Let tt be some fixed policy of Mj°. Denote R^’ 

(for k = 1,2). 


1 ^ 

lim — Tf (tt) = E 

t=l 


hm — 
T-s-oo T ^ 
t=i 




E [i?'=’i(7r)] 
E[ri] 


(a.s.), (14) 


for k = 1,2. Similarly, 




T^c 


lim ;^y'wp(rt(7r)) 


T-s-oo T 


{l-p)-E[R^’\7r)]-p-E[R^’\7r)] 

E[ri] 


(a.s.) 


(15) 

(16) 


Proof. Define by C’"' the states reachable from state sq := (1, 0, irrelevant), when 
TT is employed. We will show that C"’ is an irreducible positive recurrent Markov 
chain. For any state X in it must be that the waiting time for the next 
visit of So has hnite expectation: Assume that after T' steps the honest network 
created M{T') blocks and the attacker m{T'). If M(T') > m(T') then as long 
as the player does not adopt h — a = M(T') — this is regardless of other 

actions which the attacker possibly made in the past. As block creations are 
i.i.d, the process Y{T') = M{T') — m{T') is equivalent to a random walk on 
Z with a positive drift, hence the expected time of the last time it returns to 
the origin is finite. After which the only action the attacker can make is adopt 
and wait. As our model does not allow for pathological strategies in which the 
attacker waits for periods of inhnite expected length, the next adoption occurs 
in finite expected time. Finally, every adoption leads to Xq with probability a, 
thus the next return to Xq is of finite expectation. This state is thus positive 
recurrent. We conclude that C"’ consists of a single communicating class (the 
finite expectation of the return implies the existence of a t for which there’s a 
positive probability to return to so within t steps), hence that tt induces a single 








irreducible Markov chain C^, which is also positive recurrent, as sq is- We can 
thus use The Strong Law of Large Numbers for Markov chains (see, e.g., m pg- 
50, Corollary 79) to arrive at (ITT)) and (IT^ . The right-hand side equality in (fTTl) 
follows from the SLLN applied to renewal reward processes. □ 


The following are immediate corollaries of the strong law above: 
Corollary 9. For any admissible policy tt of MJ° , 


REV{tt) = 


lim 

T—^OO 


' t 


E 




(rl+r^) E[i?i^i]+E[i?2 


(a.s.) (17) 


Corollary 10. Let tt and tt' be two policies. 

1 . //(l-a)-E[i?i'i] — a • E [7?^’^] > 0, then tt dominates honest-mining. 

2. 7/(1 —i?i?C(7r))-E [i?^’^(7r')] — i?i?C(7r)-E > q, then-rr' dominates 

TT. 


Both assertions become strict together with the inequalities. 

The following lemma states that an optimal policy in Mj°, whose value is 
small enough, is approximately optimal in M, if only truncated policies are 
considered: 

Lemma 11. Let p G [0,1], e > 0, and Tq G N. If TT G is optimal in MJ° 
and |up| < e/2, then 

1. \p — REV (tt) I < e 

2. \p — maXjr/g^To {REV{tt')} | < e 


Proof. 


- 

T Z^t=l 


(r/ {tt') -P rf {tt' )) represents the average num- 


Observethat lim 

T-s-oo ^ 

ber of blocks added to the agreed pubic chain (aka main chain), per round, when 
tt' is deployed. Under the honest strategy, this rate equals 1, as every round ac¬ 
counts for the addition of a new block (see Section [2]) . On the other hand, no 
positive recurrent strategy can more than halve the growth rate of the main 
chain: For every block that is overridden and excluded from the main chain 
there’s a corresponding overriding block is included in it (see also the proof of 


Proposition [1} 0 Thus, E 


lim 

.T—^oo 


Y.ti{rlW) + rU^’)) > 1 / 2 . 
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This assumption is without loss of generality, as at some point the player would 
need to adopt, and the waiting time for it is finite in expectation. See the proof of 
Lemma [S] 







Part I: Relying on Lemma [5] we can manipulate the limits to obtain 


e/2 > Up = E 


1 ^ 


( 18 ) 
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p)- rl (tt) - p • (tt) 
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lim — ri (tt) 
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— p • E 
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T-s-oo T ^ 
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’»'t (t^) + (tt)) 


Using Corollary ini we obtain 


REV{tt) = 
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^1/^ T ELi pI (^) 
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lim ^EhirlM + r? 

.1 —¥oo 
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lim i (r/ (tt) + r/ (tt)) 


-T —>-oo 


< p + e. 


Similarly, > —e/2 implies 


REV{-k) > p 
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e/2 


lim 
.T —^■OO 


ELi + 


> P - e, 


which concludes the first part. 

Part U: We use here the same technique as previously. Assume by negation that 
for some policy tt' G A^°, REV{tt') > p + e. Then, similar to the previous article, 
we have 
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lim A r] (tt') 
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= REV (tt') > p + e 


(19) 
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( 20 ) 
































which contradicts the optimality of v^. This proves that p > max^/g^To {REV{tt)} 
—e. On the other hand, assume in negation that REV (tt) < p—e. We then have, 





and we arrive again at a contradiction. Therefore, REV (tt) > p — e, hence 


max.,r'eAro {REV{Tr')} > p - e. 


□ 


Corollary 12. If tt is e/i-optimal in and \v'^\ < e/4, then the inequalities 
guaranteed by Lemma \11\ hold. 

Proof. The first inequality holds for tt, as in its proof we didn’t use the assump¬ 
tion on tt’s optimality. The second inequality is a property of p (and not of the 
policy); it holds because |Up| < e/4 together with tt being e/4-optimal imply 
|u/’| < e/2, for an optimal policy tt. □ 

Finally, we are ready to prove the correctness of Algorithm [T] 

Proposition [4) 

For any Tq G N and e > 0, Algorithm [7] halts, and its output {p, tt) satisfies: 
I p — REV (tt) I < e and | p — max,^' {REViTT')} I < e. 

Proof. Observe that vj*, the optimal value of is monotonically decreasing 
in p: If Pi > p 2 and tti is optimal in then vj°* > = uj/*, 

where the strict inequality holds because Wp is strictly decreasing. Furthermore, 
vj°* is continuous in p, as Wp is. 

Now, the quantity {high — low) is halved at every iteration of the loop 
(lines (HI),®), hence the number of iterations must be finite. To understand 
what we can say about v when the algorithm halts and high — low < e/8, we 
make use of loop invariants: First, we claim that for every value assigned to low 
throughout the algorithm’s run, the value returned by mdp_solver{M^^,e/8) is 
positive. Indeed, low begins with a value of 0. Honest mining gains the attacker 
a value of a, in Mq°; mdp_solver{M^^, e/8) thus returns a positive value, as¬ 
suming e < 8 • a. Any further alteration of low’s value, in line[6l is conditioned 
to satisfy this assertion. 

Similarly, the value returned by mdp_solver{Mi°,e/8) must be negative, 
since the attacker’s profits for its blocks vanishes, and its revenue for blocks 
it adopts is negative (and such events occur in finite time, in epxectation; see 
Lemma®. In addition, any new assignment to high is conditioned to be non¬ 
positive, by linelHl 













From the monotonocity and continuity of v'^°* we deduce that the root of 
vj°* lies between low and high. However, high—low < e/8 implies that |r;| < e/8: 
Indeed, assume in negation that > e/8. Then 


e/8 < 'Cp = E 


lim — (7 
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— p • E 
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T->oo T 


- (P + e/8) -E 


lim ^ X! (^* 


T->oo T 
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■e/8 

( 21 ) 


rri 5(! 

< ir^ 

- ^p+e/4 


e/8 < VMgh 


+ e/ 8 . 


We used here the inequality E 


lim ^ {rl (71-) + r/ (tt)) 

.i —^00 


< 1 (see the proof 


of LemmafTTl) , and the strict monotonicity of vj° *. This contradicts < 0. A 

similar derivation rules out the case Vp < —e/8, which holds as a loop invariant. 
We conclude that |7;| < e/8. □ 


Proposition [3j For any T G N, if Vp* > 0 then mJ* > Vp* > vj*. Moreover, 
these bounds are tight: lim uf* — vf* = 0. 

We precede the proof of the proposition with some (fun!) probability anal¬ 
ysis. Denote by Low-Triangle the set of states {(a, h) : a > h}. Fix some pol¬ 
icy TT and a state (ao,ho). Denote by the random process defined by our 
game, where the initial state is (ao,/io)- Let ^ be a stopping time defined by 
max{t : G Low-Triangle}. If Yi^ = (ai,ai) (observe that Y^ must lie in 

the main diagonal), we denote last{ao, ho) := oi. last{ao, ho) represents the num¬ 
ber of blocks the attacker (or the honest network, for the matter) has, before 
leaving Low-Triangle for the rest of the epoch. 


Lemma 13. For any state (a, ho) G Low-Triangle, 

E [last{ao, ho)] = ~ +\-{ + ap + ho) (22) 

(1 - 2 • a) 2 VI - 2 ■ a / 

Proof. Note first that last{ao,ho) = \ ■ [ip — {ao — ho)) + op, because if the 
attacker created k blocks after reaching {ao,ho), the honest network needs to 
create precisely k + ao — ho blocks in order to leave Low-Triangle. We are thus 
left with the task of calculating E [ip]. Consider a random walk on Z, starting 
at ao — ho, with probability a of moving one step towards positive infinity and 
(1 — ol) of moving towards negative infinity. Let ip’ be the time until the last 
visit of the origin. Observe that ip' has the same distribution as ip (!), we thus 
identify them with each other, henceforth. 

We further break ip into stopping times: Let N be the number of visits to the 
origin (we have N > 0 almost surely, since the drift is towards negative inhnity). 
Let ipi be the first time up to the first visit of the origin, and for 1 < A: < A^, let 














tpk be the time that elapsed between ipk-i and the next visit to the origin. Any 
two travels that begin and end at the origin are i.i.d, and, moreover, the number 
of such travels is independent of their lengths. Therefore, by Wald’s equation, 
E[V^-V'i] =E[^] -£[^> 2 ]. 

We can interpret N as counting the number of failures before one success, 
where a success represents a visit of the origin which never returns to it (this is 
equivalent, almost surely, to never returning to the nonnegative side of Z). The 

probability of a success is ^1 — , implying that E [A^] = ^ 

Whenever the walk starts at +1 the expected return time to the origin is 
1 ^ 2-01 ■ same expression holds for the expected return time when starting at 
— 1, conditioned on a return occurring (see [IB])- Counting the first step to ±1 
as well, the expected next return to the origin, conditioned on its occurrence, is 
(1 + 0 We conclude that E[ijj - V^i] = ■ 

Another result in [TB] implies that E [-^i] = We obtain: 


E [^] =- - - (1 + - 1 -) +- 

1-2-a \ 1-2-aJ 1 


flo ~ ho 


-2-a 


E [/ast(ao, ho)] = ^ • (E [V’] - (oq - ho)) + oq = ^ • (E [^] + oo + ho) = 


1-2-a 


1 + 
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a ■ {1 — a) 
(1-2-a)" 


1-2-a 
Qq — hp 
1-2-a 


, <^0 - ho , L 1 _ 

+ 1--h tto + h-o — 

1 — Z - a 

0-0 + hp 


(23) 

(24) 

(25) 

(26) 

□ 


Proof (of Proposition^^. Part /; Let tt be an optimal policy in Mp. Assume the 
game has reached state (a, h) S Low-Triangle, and an oracle lets the attacker 
know that this is the last state in Low-Triangle which the game will reach be¬ 
fore a future adopt. Assume further that the oracle lets the attacker “cheat” and 
perform the action match with success probability 1 (granting him, effectively, 
7=1) and even if the previous state was not in Low-Triangle (ignoring thus 
restrictions on the feasibility of match). Obviously, the attacker can only benefit 
from this oracle, by waiting for the last state in Low-Triangle, and then per¬ 
forming match (it has nothing to lose by taking only the null action up to that 
point). 

Upon which, performing match on the main diagonal marks the end of the 
first epoch, since the respective chains of the attacker and the honest network 
collapse, hence the next state is distributed as Xp is. As a result, we may bound 
the accumulated immediate rewards from state (a, h) G Low-Triangle onwards, 
up to Ti, in Mp, by 

(1 - p) • E [last{a, h)] = {l-p)- +a + h). (27) 

(1-2-q;) 2 VI-2-a / 

Note that, starting at +1, the expected return time is unaffected by conditioning on 
an eventual return, since this occurs w.p.l. 




















This is precisely the reward given in state (a, h) € Low-Triangle with a = T, in 
the over-paying MDP Nj. 

We follow the same approach to bound the accumulated rewards from states 
(a, h) ^ Low-Triangle. Assume that the oracle tells the attacker whether it 
will ever return to the main diagonal (without adopting first) or not. Clearly, if 
the oracle carries the negative messa ge, the attacker is better off adopting right 
away, minimizing its negative reward^ This will imply a reward of —p ■ h. On 
the other hand, if the oracle says the process will eventually return to the main 
diagonal, the attacker is better off waiting for that event. If we denote by (oq, «„) 
the next arrival at Low-Triangle (which is necessarily on the main diagonal), 
then E [ao|return occurs] = yE^ ([II])- 

Upon which the attacker’s future rewards up to ri are bounded from above 
by (1 — p) ■ E [last{ao, oq)] = (1 — p) • ^ “o) i by (l22ll . Since this is linear 

in oo, we conclude that the expected reward from state (a, h) ^ Low-Triangle, 
conditioned on returning to the Low-Triangle, is upper bounded by (1 — p) • 

( (i- 2 -a}^ 1 ) • The probability of this event is (a/(1 — a))^~“. All in all, the 

attacker’s rewards from state (a, h) ^ Low-Triangle onward are upper bounded 

by 


1 - 
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\ — a 

h—a 


h — a^ 


1 — a 


(i-p) • 


{-p-h) + 

a • (1 — a) h — a 


(28) 


(1 - 2 •a)'" 1 - 2 - a 


This, again, is exactly the reward given in the over-paying Nj when state (a, h) ^ 
Low-Triangle with h = T is reached. 

Part II: Recall the result of Lemma [8l 


E[ri] 


(29) 


When the optimal (in Mp) tt is applied in Nj, with an adopt in the truncating 
states, the expected epoch time cannot be greater. Therefore, if Vp = v* > 0, 
this transformation can only increase Vp. We conclude that if tt is optimal policy 
in Mp, then the expected average value of (the truncated version of) tt in Nj 
upper bounds Vp = v*. An optimal policy of Nj can only do better, hence 
ttj* > Vp, which concludes the involved part of the proof. 

Part III: That v* > vj* is trivial, since any policy that is feasible in Mj is 
feasible in Mp, and the rewards are identical. Finally, we show that uj* \ vj. 
First, observe that the reward from visiting a state (a, h) ^ Low-Triangle, given 
in (1^ . converges to p ■ h. Thus, as T goes to infinity, the reward from these 
states in Nj converges to that of Mj. On the other hand, the probability to 
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Recall it is forced to adopt at some stage, as we’ve seen before. 












reach a state in Low-Triangle vanishes exponentially with T (e.g., by applying 
Chernoff’s bound). The reward given in Nj in the truncating states of the form 
{T = a > h) grows only linearly in T (see ( 1271 ) : a and h are linear in T). Therefore, 
the expected reward from these states (without conditioning on reaching them) 
vanishes. We thus obtain —>■ 0, as T goes to infinity. □ 

Corollary 14. If vj* > 0 then p + 2 ■ u^* > max^/g^ {REV{'k')}. 

Proof. The proofs in Appendix [C] did not use the truncation of the process. 
We can therefore follow the same steps as in the proof of Lemma [11] Part II: 
Put e = 2 • u^°*. Then v* > > 0, hence itj* > Vp*, by Proposition |3| 

Similarly to the implication following ( 1121 ) - ( 1201 ) . we can deduce that > 

max^r'eA {i?£d^(7r')}. □ 

Propositionj^ If u and p' are the outcome of the computation in Algorithm\^ 
lines [TMTM then p' 2 ■ {u e') > max^r'eA {REV{Tr')}. 

Proof. If low < e/4 then p' is assigned the value 0. In this case, as shown above, 
> 0. Assume that low > e/4. In the proof of Proposition |4| it was shown 
that the value returned by mdp_solver{Mf°^^ e/ 8 ) is positive. Therefore, > 
—e/ 8 . Applying the proof of Lemma ITT] we deduce that > e/ 8 , 

hence vj? > 0 . Corollary |T4| thus applies to p', and we obtain p' -|- 2 • uj* > 
max^r'eA {REV{'k')}. Observing that it -I- e' > Mp“* completes the proof. □ 

We complete the appendix with the proof of Corollary | 6 j 
Corollary | 6 ) Fix 7 and a. If u is the value returned by mdp_solver{N^, e), and 
u < —e, then honest mining is optimal for a. In other words, 0 ( 7 ) > a. 

Proof. If It < —e then the value of is smaller than 0. If we denote by Mq, 
the same modification (of disabling honest mining) applied now to Mq,, then 
the value of Mq, cannot be poisitive (similarly to Proposition |3|). However, hon¬ 
est mining guarntees a valne of 0 in Mq,, and we conclude that honest mining 
(weakly) dominates other strategies. □ 




